\(\newcommand{\mathds}[1]{\mathrm{I\hspace{-0.7mm}#1}}\) \(\newcommand{\bm}[1]{\boldsymbol{#1}}\) \(\newcommand{\bms}[1]{\boldsymbol{\scriptsize #1}}\) \(\newcommand{\proper}[1]{\text{#1}}\) \(\newcommand{\pE}{\proper{E}}\) \(\newcommand{\pV}{\proper{Var}}\) \(\newcommand{\pCov}{\proper{Cov}}\) \(\newcommand{\pACF}{\proper{ACF}}\) \(\newcommand{\I}{\bm{\mathcal{I}}}\) \(\newcommand{\wh}[1]{\widehat{#1}}\) \(\newcommand{\wt}[1]{\widetilde{#1}}\) \(\newcommand{\pP}{\proper{P}}\) \(\newcommand{\pAIC}{\textsf{AIC}}\) \(\DeclareMathOperator{\diag}{diag}\)

1  Probability Theory for Data Scientists

1.1 Set theory Concepts

NoteDefinition: Sample Space, Event, and Empty Set

Definition 1.1 Consider an uncertain scenario. This include a random experiment, a data-generating process or simply the future. We define the following concepts:

  • Sample Space (\(\Omega\)): The set of all possible outcomes or results from the scenario. Sample spaces can be either countable or uncountable. If the elements of a sample space can be put into one-to-one correspondence with the set of integers, the sample space is countable. If the sample space contains only a finite number of elements, it is also countable. Otherwise is uncountable.

  • Event: A subset of the sample space. It represents a specific outcome or a collection of outcomes of interest.

  • Empty Set (\(\emptyset\)): A set containing no elements. It represents an impossible event.

Example 1.1 If we flip a coin twice then the sample space can be written as: \[ \Omega =\{HH,HT,TH,TT\} \]

where \(H\) represents heads and \(T\) tails. This sample space is finite. An event (say \(A\)) could be at least one head appears, that is

\[A =\{HT,TH,HH\}\subset \Omega\]

Example 1.2 If we are analyzing customer purchase behavior for a single online transaction, the sample space could be the set of all possible combinations of items a customer might select from the store’s catalog. This sample space is in principle finite and therefore countable. However, if the catalog is very large, the sample space can be considered uncontably large for practical purposes. More on this later.

An event could be “customer buys at least one item from category X”, or “customer buys product Y”.

Example 1.3 We measure the time (in seconds) it takes for a user to complete a task on a website. The time limit is predefined at 5 minutes. Then the sample space is \(\Omega = \{0, 1, 2, 3,\ldots, 300\}\) which is finite. If, however, we measure the time with arbitrary precision, then the sample space is the interval \((0,300)\) of real numbers. This sample space is uncountable.

An event could be “user completes the task in under 2 minutes”. In the former case, this corresponds to the set \(A=\{1,2\ldots, 119 \}\). Int he latter case is the real interval \(A=(0, 120)\).

Events can be described in many different ways. We will use set theory and notation to describe events and operations on events. This can help later in the computation of probabilities.

NoteBasic Set Operations

Definition 1.2 Given events \(A,B,C\) in the sample space \(\Omega\):

  • Union (\(A \cup B\)): The event that \(A\) occurs, or \(B\) occurs, or both occur.

  • Intersection (\(A \cap B\)): The event that both \(A\) and \(B\) occur.

  • Complement (\(A^c\)): The event that \(A\) does not occur. It is the set of all outcomes in \(\Omega\) that are not in \(A\).

The following propertties hold for any events \(A, B, C\):

  • Commutativity:
    • Union: \(A \cup B = B \cup A\)
    • Intersection: \(A \cap B = B \cap A\)
  • Associativity:
    • Union: \((A \cup B) \cup C = A \cup (B \cup C)\)
    • Intersection: \((A \cap B) \cap C = A \cap (B \cap C)\)
  • Distributive Laws:
    • Intersection over Union: \(A \cap (B \cup C) = (A \cap B) \cup (A \cap C)\)
    • Union over Intersection: \(A \cup (B \cap C) = (A \cup B) \cap (A \cup C)\)
  • De Morgan’s Laws:
    • \((A \cup B)^c = A^c \cap B^c\)
    • \((A \cap B)^c = A^c \cup B^c\)
NoteDisjoint Sets and Partitions of Sample Space

Definition 1.3  

  • Disjoint Sets (Mutually Exclusive Events): Two sets \(A\) and \(B\) are disjoint if they have no elements in common, i.e., \(A \cap B = \emptyset\).

  • Partition: A collection of non-empty, disjoint subsets (events) of \(\Omega\) whose union is \(\Omega\). That is \(A_1,A_2, \ldots\) is a partition if

\[ \bigcup_{i} A_i = \Omega \quad \text{and} \quad A_i \cap A_j = \emptyset \text{ for } i \ne j \]

NoteRepresentation of events using set operations

Example 1.4 When we flip a coin twice, the event \(A\) “at least one head appears” can be written in various ways. These include:

  • the union of three events \(A = \{HT\} \cup \{TH\} \cup \{HH\}\). That is, \(A\) occurs if we get heads on the first flip and tails on the second flip, or tails on the first flip and heads on the second flip, or heads on both flips. Note that these three events are disjoint as they do not share any outcomes.

  • the union \(A = A_1 \cup A_2\) where \(A_1 = \{HT, HH\}\) is the event “head on first flip” and \(A_2 = \{TH, HH\}\) is the event “head on second flip”. Note that \(A_1\) and \(A_2\) are not disjoint as they both contain the outcome \(HH\).

  • the complement \(A = B^c\) where \(B = \{TT\}\) is the event “no heads appear”.

Three different partitions of the sample space are given by:

  • The trivial partition where each event contains a single outcome: \[ \mathcal P_1=\{\{HT\},\{TH\},\{HH\},\{TT\}\} \]

  • The partition: \[ \mathcal P_{equal}=\{\{HH,TT\}, \{HT,TH\}\} \] that is, when we flip the coin twice, either we get the same results in both throws OR different ones.

  • The partition where we group the outcomes based on the number of heads:

\[ \mathcal P_{heads} =\{\{TT\}, \{HT,TH\}, \{HH\}\} \]

that is, when we flip the coin twice, we can get no heads, one head or two heads.

NoteSigma Algebra

Definition 1.4 A collection \(\mathcal{F}\) of subsets of \(\Omega\) is a sigma algebra (or \(\sigma\)-algebra) if it satisfies the following properties:

  1. \(\Omega \in \mathcal{F}\) (The sample space is in the collection).
  2. If \(A \in \mathcal{F}\), then \(A^c \in \mathcal{F}\) (The collection is closed under complementation).
  3. If \(A_1, A_2, \dots\) are in \(\mathcal{F}\), then \[ \bigcup_i A_i \in \mathcal{F} \]

that is, the collection is closed under arbitray number of unions.

Note the definition of sigma-algebra does not explicitly require that the intersection of two sets in \(\mathcal F\) is also in \(\mathcal F\). However, this property follows from the other properties and De Morgan’s laws. If \(A,B \in \mathcal F\), then

\[ \cancel{A\cup B \in \mathcal F \implies (A\cup B)^c = A^c \cap B^c \in \mathcal F \implies (A^c \cap B^c)^c = A \cup B \in \mathcal F} \]

\[ A^c \in \mathcal F\,,B^c \in \mathcal F \implies A^c \cup B^c \in \mathcal F \implies (A^c \cup B^c)^c= A \cap B\in \mathcal F \]

(corrected from previous version)

NoteExamples of sigma-algebras

Example 1.5 The trivial sigma algebra is clearly \(\mathcal F_0=\{\emptyset, \Omega\}\) which does not seem very useful.

The partition \(\mathcal P_{equal}=\{\{HH,TT\}, \{HT,TH\}\}\) above, is not a sigma-algebra as it does not contain the empty set. If we add the empty set, then is still not a sigma algebra as it is not closed under union. The union of the only two elements is \(\Omega\). If we include \(\Omega\) then we have the sigma algebra:

\[ \mathcal F_{equal}=\{\emptyset, \Omega, \{HH,TT\}, \{HT,TH\}\} \]

The partition \(\mathcal P_{heads}\) above is also not a sigma algebra but if we add all possible unions then we obtain the sigma algebra:

\[ \begin{aligned} \mathcal F_{heads}& =\{\emptyset, \Omega, \{TT\}, \{HT,TH\}, \{HH\}, \{HT,TH,HH\},\\ & \{HT,TH,TT\}, \{HH,TT\}\} \end{aligned} \]

The set \[ \mathcal G =\{\emptyset,\Omega,\{HT\},\{TH\},\{HH\},\{TT\}\} \]

obatined from \(\mathcal P_1\) is neither a partition nor a sigma algebra as it is not closed under union. For example, \(\{HT\}\cup \{TH\}=\{HT,TH\}\notin \mathcal G\). However, if we add all possible unions of the elements of \(\mathcal G\) we obtain the power set of \(\Omega\), that is the set of all subsets of \(\Omega\):

\[ \begin{aligned} \mathcal F_{max} &= \{\emptyset, \{HT\},\{TH\},\{HH\},\{TT\},\\ & \{HT,TH\}, \{HT,HH\}, \{HT,TT\}, \{TH,HH\}, \{TH,TT\}, \{HH,TT\}\\ & \{HT,TH,HH\}, \{HT,TH,TT\}, \{HT,HH,TT\}, \{TH,HH,TT\},\Omega \} \end{aligned} \]

This is the largest possible sigma-algebra for this sample space. It has \(2^4=16\) elements since the sample space has 4 elements. In general, if the sample space has \(n\) elements, then its power set has \(2^n\) elements.

Also generally, if we have a finite partition of \(\Omega\) then the collection of all unions of sets in the partition (including the empty set) is a sigma-algebra.

Note that different sigma algebras serve for different purposes. For example, the sigma algebra \(\mathcal F_{equal}\) is useful if we are only interested in whether the two coin flips are the same or different. The sigma algebra \(\mathcal F_{heads}\) is useful if we areinterested in the number of heads. The power set \(\mathcal F_{max}\) is a sigma algebra that me be more useful if we are interested in all possible events.

In this example we also observe that:

\[ \mathcal F_0 \subset \mathcal F_{equal} \subset \mathcal F_{heads} \subset \mathcal F_{max} \]

so that \(\mathcal F_0\) and \(\mathcal F_{max}\) are the smallest and largest sigma algebras possible for this sample space.

1.2 Probability

We will start by defining probability in an intuitive way. Later we will give a more formal mathematical definition .

1.2.1 Types of Probability

There are several ways to think about probability. These include

  • Classical Probability: Assumes all possible outcomes in a finite sample space are equally likely. That is, for any event \(A\) with \(n(A)\) outcomes in a sample space \(\Omega\) with \(n(\Omega)\) equally likely outcomes, the probability of \(A\) is: \[ P(A) = \frac{n(A)}{n(\Omega)} \]

    Example 1.6 Under this framework, the probability of rolling an even number on a die is assigned to be \(P(\text{rolling an even number}) = \frac{3}{6}\). More, generally this is equivalent to say the die is fair. Another example is when we assign the probability of rain tomorrow, locally at 10 AM, to be 1/2 as there are only two possible outcomes: rain or no rain.

  • Empirical (or Frequentist) Probability: Based on observed frequencies from repeated experiments. As the number \(N\) of experiment repetitions increases, the probability of an event \(A\) approaches the true probability: \[ P(A) \approx \frac{\text{Number of times A occurred}}{N} \]

    Example 1.7 If we do not what the probability of heads when flipping a coin is. We can we flip the coin 1000 times and if it lands heads 537 times, we would say the empirical probability of heads is \(0.537\). Furthermore we might say that the true probability of heads is \(\approx 0.537\) and the important aspect of thios framework is that, in theory, the more times we flip the coin the closer the empirical proportion will be to the true probability. Finally, according to historical data for our location, it has rained 33.6% of the days out of the last 10 years. The empirical probability of rain tomorrow locally at 10 AM is 0.336.

  • Subjective Probability: Based on personal belief or judgment, often used when objective data is scarce.

    Example 1.8 I had a look through the window and is a bit overcast, then I believe the probability of rain tomorrow locally at 10 AM is 0.7. On the other hand, if I am a weather expert from the point of atmospheric physics, I might believe the probability of rain tomorrow locally at 10 AM is 0.9.

1.2.2 Formal definition of probability

After we have chosen a sigma algebra \(\mathcal F\) that contains events we are interested in, we can define probabilities for all the events in a more formal way.

NoteProbability Measure (Kolmogorov’s Axioms)

Definition 1.5 A probability measure \(P\) on a sample space \(\Omega\) with a \(\sigma\)-algebra \(\mathcal{F}\) is a function \(P: \mathcal{F} \to [0, 1]\) that assigns a probability to each event in \(\mathcal{F}\) and satisfies the following three axioms:

  1. Non-negativity: For any event \(A \in \mathcal{F}\), \(P(A) \ge 0\). The probability of any event is non-negative.

  2. Normalization: \(P(\Omega) = 1\). The probability of the entire sample space (the certain event) is 1.

  3. Additivity (for disjoint events): If \(A_1, A_2, \dots, A_n\) are disjoint events in \(\mathcal{F}\) (i.e., \(A_i \cap A_j = \emptyset\) for \(i \ne j\)), then \[ P\left(\bigcup_{i=1}^n A_i\right) = \sum_{i=1}^n P(A_i) \] For a countably infinite sequence of disjoint events, this extends to: \[ P\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i) \] The probability of the union of disjoint events is the sum of their individual probabilities.

NoteProbability measure for the equality of two coin flips

Example 1.9 For the sigma algebra \(\mathcal F_{equal}=\{\emptyset, \Omega, \{HH,TT\}, \{HT,TH\}\}\) we can define a probability measure simply by specifying:

  • \(P(\emptyset) = 0\)
  • \(P(\{HH,TT\}) = 0.4\)

Note we can compute the probability of the other two events in \(\mathcal F_{equal}\) using the axioms:

  • \(P(\Omega) = 1\) (by axiom 2)

  • \(P(\{HT,TH\}) = P(\Omega) - P(\{HH,TT\}) = 1 - 0.4 = 0.6\) (by axiom 3)

The assignment of probability of \(\{HH,TT\}\) to be \(0.4\) maybe frequentist or subjective but regardless of this, it generates is a valid probability measure as it satisfies all three axioms.

NoteProbability measure for the number of heads in two coin flips

Example 1.10 For the sigma algebra \(\mathcal F_{heads}\) we can define a probability measure simply by specifying:

  • \(P(\{HT,TH\}) = 0.5\)
  • \(P(\{TT\}) = 0.1\)

The probabilities for the rest of the event in \(\mathcal F_{heads}\) can be computed using axiom 3 as follows:

  • \(P(\{HH\}) = 1-0.1-0.5 = 0.4\)
  • \(P(\{HT,TH,HH\}) = P(\{HT,TH\}) + P(\{HH\}) = 0.5 + 0.4 = 0.9\)
  • \(P(\{HT,TH,TT\}) = P(\{HT,TH\}) + P(\{TT\})= 0.5 + 0.1 =0.6\)
  • \(P(\{HH,TT\}) = 0.1 +0.4 = 0.5\)
  • \(P(\Omega) = 1\) (Trivial but good to double check in practice)
  • \(P_{heads}(\emptyset) = 1-1=0\) (Trivial, always true)

As before the probability assignment maybe frequentist or subjective but regardless of this, it generates is a valid probability measure as it satisfies all three axioms.

NoteProbability measure for power set

Example 1.11 For the largest sigma algebra \(\mathcal F_{max}\) we can define a probability measure simply by specifying probabilities for the four singletons or atoms:

  • \(P(\{HH\}) = 0.3\)
  • \(P(\{HT\}) = 0.2\)
  • \(P(\{TH\}) = 0.4\)

The probabilities for the rest of the events in \(\mathcal F_{max}\) can be computed using the axioms as follows:

  • \(P(\{TT\}) = 1-0.3-0.2-0.4 = 0.1\)
  • \(P(\{HT,TH\}) = 0.2 + 0.4 = 0.6\)
  • \(P(\{HT,HH\}) = 0.2 + 0.3 = 0.5\)
  • \(P(\{HT,TT\}) = 0.2 + 0.1 = 0.3\)
  • \(P(\{TH,HH\}) = 0.4 + 0.3 = 0.7\)
  • \(P(\{TH,TT\}) = 0.4 + 0.1 = 0.5\)
  • \(P(\{HH,TT\}) = 0.3 + 0.1 = 0.4\)
  • \(P(\{HT,TH,HH\}) = 0.2 + 0.4 + 0.3 = 0.9\)
  • \(P(\{HT,TH,TT\}) = 0.2 + 0.4 + 0.1 = 0.7\)
  • \(P(\{HT,HH,TT\}) = 0.2 + 0.3 + 0.1 = 0.6\)
  • \(P(\{TH,HH,TT\}) = 0.4 + 0.3 + 0.1 = 0.8\)

As before the probability assignment maybe frequentist or subjective but regardless of this, it generates is a valid probability measure as it satisfies all three axioms.

NoteSimple Probability Operations

Proposition 1.1 From the axioms, we can derive several useful properties:

  • Probability of the Complement: For any event \(A \in \mathcal{F}\), \[ P(A^c) = 1 - P(A) \]

  • Probability of the empty set: \(P(\emptyset) = 0\).

  • Probability of the Union of Two Events (General): For any two events \(A, B \in \mathcal{F}\): \[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \] This is known as the addition rule. It accounts for the overlap between events.

NoteProbability of the union

Example 1.12 \[ \begin{aligned} P(\{HT,TH,TT\}\cup\{HH,TT\})&=P(\{HT,TH,TT\})+P(\{HH,TT\})-P(\{TT\})\\ & = 0.6+0.5-0.1\\ & =1 \end{aligned} \]

clearly this is correct as the union of these two events is \(\Omega\).

NoteBoole and Bonferroni inequalities

Theorem 1.1  

  • Boole’s inequality For any events \(A_1, A_2, \ldots, A_n\) in \(\mathcal F\): \[ P\left(\bigcup_{i=1}^n A_i\right) \le \sum_{i=1}^n P(A_i) \] This inequality provides an upper bound for the probability of the union of events.

    Bonferroni Inequality: For any events \(A_1, A_2, \ldots, A_n\) in \(\mathcal F\): \[ P\left(\bigcap_{i=1}^n A_i\right) \ge 1 - \sum_{i=1}^n P(A_i^c) \] This inequality provides a lower bound for the probability of the intersection of events.

These inequalities, specially Bonferroni’s will be useful later. Booles inequality can be proved by induction and Bonferroni’s inequality follows from Booles inequality and the properties of complements. These facts can be verified by the reader.

1.3 Conditional Probability

NoteConditional Probability

Definition 1.6 The conditional probability of event \(A\) given that event \(B\) has occurred, denoted \(P(A|B)\), is defined as: \[ P(A|B) = \frac{P(A \cap B)}{P(B)} \] provided that \(P(B) > 0\). This measures the probability of event \(A\) occurring, knowing that event \(B\) has already happened.

NoteExample of Conditional Probability

Example 1.13 What is the probability of getting heads on the first coin flip GIVEN that at least one head appears in two flips? This can be expressed as \(P(A|B)\) where \(A=\{HT,HH\}\) is “head on first flip” and \(B=\{HT,TH,HH\}\) is “at least one head appears”. We have:

\[ P(A|B) = \frac{P(A \cap B)}{P(B)} =\frac{P(A)}{P(B)}= \frac{P(\{HT,HH\})}{P(\{HT,TH,HH\})} \]

since \(A\subset B\) in this case. We notice a subtlety here. The event \(A =\{HT,HH\}\) (head on the first flip) is not a member of the sigma-algebra \(\mathcal F_{heads}\). So cannot use the probability measure \(P\) from Example 1.10 to compute this conditional probability. However, it is a member of the sigma algebra (the power set) \(\mathcal F_{max}\) so we might need to use the probabilities using such sigma algebra (see Example 1.11) as follows:

\[ P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(\{HT,HH\})}{P(\{HT,TH,HH\})} = \frac{0.5}{0.9} \approx 0.556 \]

On a more practical situation, if \(A\) is “a user makes a purchase” and \(B\) is “a user clicks on an advertisement”, then \(P(A|B)\) is the probability that a user makes a purchase GIVEN that they clicked on the advertisement. This is a key metric for evaluating ad campaign effectiveness.

Two very useful consequences of the above are: the law of total probability that combines the notion of partition with that of conditional probability and Bayes rule that allows us to reverse conditional probabilities.

NoteLaw of Total Probability

Proposition 1.2 Let \(A_1, A_2, \dots\) be a partition of the sample space \(\Omega\) (recall Definition 1.3). Then for any event \(B \in \mathcal{F}\): \[ P(B) = \sum_{i} P(B|A_i) P(A_i) \]

NoteBayes’ Rule

Proposition 1.3 For events \(A\) and \(B\) where \(P(B) > 0\): \[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} \] If \(A_1, \dots, A_n\) form a partition of \(\Omega\), and \(P(A_i) > 0\) for all \(i\), then Bayes’ Rule can be written using the Law of Total Probability for the denominator: \[ P(A|B) = \frac{P(B|A) P(A)}{\sum_{i=1}^n P(B|A_i) P(A_i)} \]

The proof of this result is staightforward and left to the reader.

NoteExample of Law of Total Probability and Bayes’ Rule

Example 1.14 (Medical Testing) Suppose a rare disease affects 1 in 10,000 people. A test for this disease is 99% accurate:

  • If a person has the disease, the test correctly identifies it 99% of the time (True Positive).
  • If a person does not have the disease, the test correctly identifies it 99% of the time (True Negative).

Let \(D\) be the event that a person has the disease, and \(T^+\) be the event that the test is positive. The probabilities we know are:

  • \(P(D) = \frac{1}{10000} = 0.0001\) (Prevalence)
  • \(P(T^+|D) = 0.99\) (Sensitivity - True Positive Rate)
  • \(P(T^-|D^c) = 0.99\) (Specificity - True Negative Rate)

Before we proceed we note the probability specifications above are emprical.

Suppose we want to find \(P(D|T^+)\), the probability that a person actually has the disease given a positive test result.

First, we need \(P(T^+)\). A positive test can occur in two ways:

  • (\(D \cap T^+\)) or

  • (\(D^c \cap T^+\))

e.g. a partition of \(A\). We also have:

  • \(P(T^+|D^c) = 1 - P(T^-|D^c) = 1 - 0.99 = 0.01\) (False Positive Rate)
  • \(P(D^c) = 1 - P(D) = 1 - 0.0001 = 0.9999\)

Using the law of total probability:

\[ \begin{aligned} P(T^+) &=P(T^+ \cap D)+P(T^+\cap D^c)\\ & =P(T^+|D)P(D) + P(T^+|D^c)P(D^c)\\ &= (0.99)(0.0001) + (0.01)(0.9999)\\ &= 0.000099 + 0.009999 = 0.010098 \end{aligned} \]

Now, using Bayes’ Theorem: \[ \begin{aligned} P(D|T^+) &= \frac{P(T^+|D) P(D)}{P(T^+)} \\ &= \frac{(0.99)(0.0001)}{0.010098} \approx 0.0098 \end{aligned} \]

This may look counter-intuitive. Even with a positive test, there’s only about a 0.98% (less than 1%) chance the person actually has the disease! In particular it is a rare disease. This highlights the importance of understanding base rates and conditional probabilities in interpreting results.

The Law of Total probability allows us to calculate the probability of an event \(A\) by considering the different ways it can occur through the events in a partition.

NoteCustomer churn

Example 1.15 Suppose we have three models, \(M_1\), \(M_2\), and \(M_3\), that are used to predict customer churn. Let \(P(M_1)=0.5\), \(P(M_2)=0.3\), \(P(M_3)=0.2\) be the probabilities that each model is the “best” for a given customer. Let \(A\) be the event “customer churns”. If we know the probability of churn given each best model (e.g., \(P(A|M_1)=0.1\), \(P(A|M_2)=0.2\), \(P(A|M_3)=0.15\)), the Law of Total Probability allows us to find the overall probability of churn:

\[ \begin{aligned} P(A) &= P(A|M_1)P(M_1) + P(A|M_2)P(M_2) + P(A|M_3)P(M_3) \\ &= (0.1)(0.5) + (0.2)(0.3) + (0.15)(0.2)\\ & = 0.05 + 0.06 + 0.03 = 0.14 \end{aligned} \]

1.4 Independence

1.4.1 Independence of events

First intuitively, two events, \(A\) and \(B\), are considered independent if the occurrence of one event does not affect the probability of the other event occurring. Formally,

NoteIndependent Events

Definition 1.7 Tvents \(A\) and \(B\) are independent if: \[ P(A \cap B) = P(A) \times P(B) \] or equivalently, if either of the following conditions hold:

  • \(P(A|B) = P(A)\)

  • \(P(B|A) = P(B)\)

This means that knowing that event \(B\) has occurred gives us no new information about the probability of event \(A\) occurring, and vice versa.

NoteIndependent event when flipping a coin twice

Consider the following events when flipping a coin twice:

  • \(A=\{HT, HH\}\) the first flip is heads
  • \(B=\{TT, HH\}\) the two flips are the same

Then using the probabilities in Example 1.11 we have:

\[ P(A\cap B) = P(\{HH\}) = 0.3 \neq P(\{HT,HH\})P(\{TT,HH\}) = 0.5\times 0.4 = 0.2 \]

Therefore these two events are not independent. Of course, the assignment of probabilities here play a role. In this way, if we had assigned \(P(\{HH\})=0.2\) then the events would have been independent.

An obvious consequence of Definition 1.6 of conditional probability is the so-called multiplication rule.

NoteMultiplication Rule

Proposition 1.4 For any two events \(A\) and \(B\) \[ P(A \cap B) = P(A) \times P(B|A) = P(B) \times P(A|B) \]

NoteDrawing Cards without Replacement

Example 1.16 Imagine drawing two cards from a standard deck without replacement. Let \(A\) be the event that the first card is a Heart. \(P(A) = \frac{13}{52}\). Let \(B\) be the event that the second card is a Heart. Since the first card is not replaced, these events are dependent. The probability of the second card being a Heart depends on the first card drawn. \(P(B|A)\) (the probability the second card is a Heart, given the first was a Heart) is \(\frac{12}{51}\) (as there are 12 Hearts left and 51 total cards). So, the probability of drawing two Hearts in a row is \(P(A \cap B) = P(A) \times P(B|A) = \frac{13}{52} \times \frac{12}{51}\).

Note

The outcome of flipping a coin maybe independent of the outcome of any previous coin flips. If you flip a coin and get heads, the probability of getting heads on the next flip should remain as before. Of course, this a simplifying assumption that may not hold in practice. In this course we will make these kind of assumption specially when it involves sequences of events. Not assuming independence for sequences of events make things more complicated for what we want to achieve in this course.

1.5 Random Variables

So far we have talked about events, which are subsets of the sample space. In many applications, especially in data science, we are interested in quantifying outcomes numerically. This is where random variables come into play.

NoteRandom Variable

Definition 1.8 A random variable \(X\) is a function that maps outcomes from the sample space \(\Omega\) to real numbers. That is, \(X: \Omega \to \mathbb{R}\). It quantifies the outcomes of a random phenomenon numerically.

NoteRandom variable examples

Example 1.17 If \(\Omega\) is the set of all possible customer orders, a random variable \(X\) could be “the total dollar amount spent in an order”. For each order (an outcome in \(\Omega\)), \(X\) assigns a specific monetary value. As another example: for a user’s session on a website, \(X\) could be “the number of pages visited” or the “overall time spent in the website”.

Note a random variable is a function defined on the sample space \(\Omega\) rather than on a sigma algebra, so for each outcome \(\omega \in \Omega\), there is a corresponding real number \(X(\omega)\). However, events in a sigma algebra can be defined in terms of random variables. For example, the event “the total amount spent in an order is greater than 50 dollars” can be expressed as \(\{X > 50\}\).

The set of all possible values that \(X\) can take is called the image or range of the random variable, denoted as \(X(\Omega)\).

NoteRandom variable: Number of equal coin flips

Example 1.18 When we flip a coin twice, the sample space is \(\Omega = \{TT, HT, TH, HH\}\). We can define a very simple random variable \(X\) as the “number of times the flips are the same”. The mapping would be:

  • \(X(\{TT\}) = 1\) (both flips are the same)
  • \(X(\{HT\}) = 0\) (flips are different)
  • \(X(\{TH\}) = 0\) (flips are different)
  • \(X(\{HH\}) = 1\) (both flips are the same)

The possible values of \(X\) are \(\{0, 1\}\).

NoteRandom variable: Number of heads in two coin flips

Example 1.19 When we flip a coin twice, the sample space is \(\Omega = \{TT, HT, TH, HH\}\). We can define a random variable \(X\) as the “number of heads” in the two flips. The mapping would be:

  • \(X(\{TT\}) = 0\) (no heads)
  • \(X(\{HT\}) = 1\) (one head)
  • \(X(\{TH\}) = 1\) (one head)
  • \(X(\{HH\}) = 2\) (two heads)

The possible values of \(X\) are \(\{0, 1, 2\}\). This random variable quantifies the outcome of the coin flips in terms of the number of heads observed. Also note the order in which the heads appear does not matter for this random variable.

The above two random variables are discrete random variables as they take on a finite or countable number of values, that is \(X(\Omega)\) is finite or countable. There are also continuous random variables that can take on any value in a continuous range.

NoteContinuous vs Discrete Random Variables

Example 1.20 Going back to Example 1.3 we have already defined a random variable \(X\) as the time to complete a task in a website with a limit of 5 minutes. If we round to the nearest second, then the possible values of \(X\) are \(\{0,1, 2, 3, \ldots, 300\}\) and \(X\) is a discrete random variable. However, if we do not round then \(X\) can take any value in the interval \((0,300)\) and \(X\) is a continuous random variable.

The definition of continuous random variables requires a bit more than simply having an uncountably infinite image set \(X(\Omega)\) . The definition is a bit thechnical as it involves the notion of probability density function.

NoteDiscrete and continuous random variables

Definition 1.9 We say a random variable \(X\) is

  • discrete if it takes on a finite or countably infinite number of distinct values. That is if the image set \(X(\Omega)\) is either finite or countably infinite.

The function: \[ f_X(x) = P(X=x):=P(\{\omega\,:\,X(\omega)=x\})\quad \mbox{for } x\in X(\Omega) \]

is called the probability mass function (PMF) of the discrete random variable \(X\). The PMF satisfies:

  • \(f_X(x) \ge 0\) for all \(x \in X(\Omega)\).

  • \(\sum_{x \in X(\Omega)} f_X(x) = 1\)

  • continuous if there exists a function \(f_X(x)\) such that for any two numbers \(a\) and \(b\) with \(a < b\): \[ P(a \le X \le b):=P(\{\omega\,:\, a \leq X(\omega)\leq b\}) = \int_a^b f_X(x) dx \] where

  • \(f_X(x) \ge 0\) for all \(x\) and

  • \(\int_{-\infty}^{\infty} f_X(x) dx = 1\).

The function \(f_X(x)\) is called the probability density function (PDF) of the random variable \(X\).

The idea is that there are no “gaps”, which would correspond to real numbers which have a finite probability of occurring. Instead, continuous random variables never take an exact prescribed value, that is \(P(X=x)=0\) for all \(x\) but there is a positive probability that its value will lie in particular intervals which can be arbitrarily small.

NoteCumulative Distribution Function (CDF)

Definition 1.10 The cumulative distribution function (CDF) of a random variable \(X\), denoted by \(F_X(x)\), is the function \(F: \mathcal R \to [0,1]\) defined by \[ F_X(x) = P(X \le x):=P(\{\omega\,:\,X(\omega)\leq x\}) \]

for any real number \(x\). The CDF gives the probability that the random variable \(X\) takes on a value less than or equal to \(x\).

We note that

\[ \begin{aligned} P(a \leq X <b ) &= F_X(b) - F_X(a)\\ \rule{0in}{4ex} &= \int_a^b f_X(x) dx \end{aligned} \]

so that

\[ f_X(x) = \frac{d}{dx} F_X(x)\qquad \forall x \in \mathbb R \]

NoteCDF of a discrete random variable

Example 1.21 Consider a discrete random variable \(X\) with possible values in the set \(\{0, 1, 2, 3\}\). Assume we probability mass function (pmf) is given by: \[ f_X(x) = P(X=x) = \begin{cases} 0.1 & \text{if } x = 0 \\ 0.3 & \text{if } x = 1 \\ 0.4 & \text{if } x = 2 \\ 0.2 & \text{if } x = 3 \\ 0 & \text{otherwise} \end{cases} \]

then the CDF is given by

\[ F_X(x) = P(X \le x) = \begin{cases} 0 & \text{if } x < 0 \\ 0.1 & \text{if } 0 \le x < 1 \\ 0.4 & \text{if } 1 \le x < 2 \\ 0.8 & \text{if } 2 \le x < 3 \\ 1.0 & \text{if } x \ge 3 \end{cases} \]

We can plot the CDF in Python as follows:

Code
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-1, 4, 1000)
y = np.piecewise(x, [x < 0, (x >= 0) & (x < 1), (x >= 1) & (x < 2), (x >= 2) & (x < 3), x >= 3], [0, 0.1, 0.4, 0.8, 1.0])
plt.step(x, y, where='post')
plt.title('CDF of Discrete Random Variable')
plt.xlabel('x')
plt.ylabel('F(x)')
plt.grid()
plt.yticks(np.array([0,0.1, 0.4, 0.8, 1.0]))
plt.show()

Is a step function with jumps at the points where the random variable takes values.

NoteProperties of Cumulative Distribution Functions

For any random variable \(X\), its CDF \(F_X(x)\) has the following properties:

  1. Monotonicity: \(F_X(x)\) is non-decreasing. For any \(x_1 < x_2\), \(F_X(x_1) \le F_X(x_2)\).
  2. Limits: \(\lim_{x \to -\infty} F_X(x) = 0\) and \(\lim_{x \to \infty} F_X(x) = 1\).
  3. Right-continuity: \(F_X(x)\) is right-continuous, meaning \(\lim_{h \to 0^+} F_X(x+h) = F_X(x)\) for all \(x\).

The properties above hold for both discrete and continuous random variables. For discrete random variables, the CDF is a step function (continuous from the right), while for continuous random variables, the CDF is a continuous function.

NoteCDF of a continuous random variable

Example 1.22 Consider a random variable \(X\) representing the time (in hours) a server remains operational before crashing. \(X\) can take any non-negative real value. Assume the probability density function (pdf) is given by: \[ f_X(x)= \begin{cases} \frac{1}{100} e^{-x/100} & \text{if } x \ge 0 \\ \rule{0in}{3ex}0 & \text{if } x < 0 \end{cases} \]

This probability distribution is called an exponential distribution with a mean of 100 hours. We will define and talk about mean later. The CDF is computed as follows:

\[ F_X(x) = P(X \le x) = \int_{-\infty}^x f_X(t) dt = \begin{cases} 0 & \text{if } x < 0 \\ 1 - e^{-x/100} & \text{if } x \ge 0 \end{cases} \]

The CDF \(F_Y(y) = P(Y \le y)\) would give the probability that the server operates for at most \(y\) hours. For instance, \(F_Y(10)\) would be the probability the server fails within the first 10 hours.